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Communicated by Tadaimtsu Kishimoto, August 8, 1994 

ABSTRACT We have previously reported that in four 
transcription factor families the DNA-recognition rules can be 
described as (0 chemical rules, which list possible pairings 
between the 20 amino acid residues and the four DNA bases, 
and («) stereochemical rules, which describe the base and 
amino acid positions in contact. We have incorporated these 
rules Into a computer program and examined the nature of the 
rules. Here we conclude that the DNA recognition rules are 
simple, logical, and consistent. The rules are specific enough to 
predict DNA-binding characteristics from a protein sequence. 



A large number of transcription factors, which play dominant 
roles in transcription regulation by binding to different DNA 
sequences, have been identified. Since the three-dimensional 
structure of a protein is uniquely fixed by its amino acid 
sequence, basic rules are expected, which would predict the 
DNA-binding specificity from a transcription factor sequence. 
But, since the initial expectation of such rules (the recognition 
code) (1), many structural biologists have expressed skepti- 
cism about their existence (for example, see ref. 2). 

The crystal structures of a number of transcription factor- 
DNA complexes have been determined (3-27); also a consid- 
erable amount of biochemical, genetic, and statistical infor- 
mation about the binding specificity of transcription factors is 
available (28-34). By using these data, we have devised a 
method of analyzing the patterns of contacts between DNA 
bases and amino acid residues (35-40) and have described the 
DNA-recognition rules of four transcription factor families: 
the probe helix (PH), which includes homeo and zipper 
proteins (35, 36); the helix-turn-helix (HTH) (M.S. and M. 
Gerstein, unpublished results); the zinc finger (ZnF) (37, 38); 
and the C4 Zn-binding proteins (C4), which include hormone 
receptors and GATA proteins (38-40). These rules concern 
contacts from amino acid side chains in a recognition helix to 
DNA bases in the major groove. 

The aim of this paper is to establish a framework of 
DNA-recognition rules common to the four families and to 
examine whether, from the nature of the rules, they consti- 
tute a recognition code. 

Framework of the DNA Recognition Rules 

The DNA-recognition rules are of two types, chemical and 
stereochemical. The chemical rules list possible pairing part- 
ners of amino acid side chains and DNA bases through 
hydrogen bonding or hydrophobic interaction (Fig. la; ref. 
36). The sizes of residues are also important; from a fixed 
position on an interaction surface, a longer side chain can 
reach a more distant part of the DNA. The residues are 
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classified roughly into four groups — small, medium, large, 
and aromatic (Fig. la; ref. 36). These chemical rules are 
general for any binding motif. 

The inclination of the recognition helix in the major groove 
of DNA is fixed by the structural elements specific to a 
DNA-binding motif. For instance, a recognition helix of PH 
has conserved Arg/Lys positions, which bind to DNA phos- 
phates and thereby fix the binding geometry (35, 36). As a 
consequence, each binding motif uses a set of particular 
amino acid positions for base recognition. These can be easily 
summarized into a chart with specifications of the sizes of 
residues used; each DNA-binding motif has its own specific 
stereochemical chart (Fig. 1 b-e). ZnF motifs can be subdi- 
vided into two groups (37), but here only the larger group is 
discussed (A fingers). 

Binding Score 

We have incorporated the rules into a computer program, 
which is written in the C programming language and imple- 
mented under the Unix operating system. Its core function is 
to score the match between the given DNA and protein 
sequences. This binding score is essentially the number of 
contacts predicted between the two sequences and reflects 
the binding energy. 

To calculate the binding score, points for stereochemical 
(see the legend to Fig. 1 b-e) and chemical (Fig. la) merits are 
introduced. The binding score is calculated as the sum over 
all the contacts of (stereochemical merit point) x (chemical 
merit point) for each interaction. The chemical merit points 
given to different base-residue partners are not always the 
same (Fig. la). For instance, Arg and Lys could bind by a 
hydrogen bond to T, G, or A. But in fact they recognize the 
G base almost exclusively (36), because the G base in a G-C 
pair is electrically polar (negatively charged), while Arg and 
Lys have a positive charge. Therefore, binding of Arg or Lys 
to G should be given more points than to T or A. Similarly, 
not all the contacts in the stereochemical charts appear to be 
equally important (refs. 36 and 37; M.S. and M. Gerstein, 
unpublished results), and this is reflected in differences in the 
two grades of stereochemical merit points (see contacts 
marked with diamonds and those not in Fig, 1 b-e). 

Often several different sets of contacts are possible for 
given protein and DNA sequences. In this case, the pairing 
with the highest score is chosen. However, it is stereo- 
chemically forbidden to make two contacts that cross over 
each other in the chart. For instance, in Fig. lc aa 5 can 
contact C3, and aa 8 can contact C2, but not simultaneously. 
As an example, the binding score of CAP (Fig. 2h) is the sum 
of the products of the chemical and stereochemical points for 



Abbreviations: PH, probe helix; HTH, helix-tum-helix; ZnF, zinc 
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Fig. 1 . Chemical (a) and stereochemical (b-e) rules that make the 
DNA-recognition code and code tables for C4 (J) and ZnF is), (a) The 
chemical merit p o ints art aho shown. Residues in boldfaced letters are 
those important for specificity (specific residues), {b-e) Sketches of 
the DNA major groove with the bases, W1-W4 (top strand) and C1-C4 
(bottom strand), to which a recognition helix (in the middle line) binds. 
Hie sizes of residues, small (s), medium (m), and large (I), used for the 
contacts are also shown. Aromatic residues may often be included 
with the large group. Ten stereochemical merit points are given to the 
contacts marked with diamonds and five to the other contacts. No 
stereochemical points are given otherwise. If a hydrophobic interac- 
tion takes place to a T base and if one of the two neighboring bases is 
another T, an additional 3 points is added to the chemical merit point, 
since this is likely to enhance the hydrophobic environment. The 
binding specificity of Asn (aa 1) of PH is affected by Asn (aa 2) through 
side chain-side chain interactions (36); if Asn occupies position 2, Asn 
(aa 1) interacts with Asn (aa 2) and binds to A (W2), but if not Asn (aa 
2) bridges the CI and W2 bases. For this reason, if position 2 is 
occupied by Asn, the chemical merit point of Asn (aa 1) to A (W2) is 
kept at 15; if not, it is decreased to 10 and the residue is allowed to bind 
to the CI base at the same time. When a single residue binds to two 
bases simultaneously, the two contacts are handled independently. 
This is to simplify the computer program, although the two bases 
bridged in this way are limited and can be handled as a set (36). The 
code tables if and g) are made by choosing the columns from a 
according to the residue sizes specified in </and e. The interaction of 
hydrophobic residues to the C base is weaker and therefore is shown 
by plain instead of boldfaced letters. Position 3 in ZnF can be occupied 
by a medium or large residue, but a medium residue is preferable (37); 
the large residues are shown in the parentheses. 

the Arg-G, Arg*G, aad -GluC^contacts, respectively— (10 x 
15) + (5 x 15) + (10 x 12) = 345. 

Consistency and Specificity of the Rules 

The DNA recognition rules were originally deduced from 25 
crystal structures (3-27) and many other transcription factors 



Proc. Natl Acad, ScL USA 91 (1994) 

whose binding specificity has been characterized by genetic 
or biochemical experiments (see the references cited in refs 
35-40). 

Contacts were predicted by the program for 73 recognition 
helices: those of 10 PH proteins, 20 HTH proteins, 38 ZnF 
proteins (specific or very specific A fingers listed in ref. 37), 
and 5 C4 proteins (selected examples are shown in Fig. 2). 

In most examples, the predicted contacts are essentially 
the same as those observed or predicted in earlier work. Thus 
the rules can consistently explain the amino acid-base con- 
tacts. However, this does not necessarily suggest that the 
rules can explain how factors discriminate between the target 
and other DNA sequences; if many other DNA sequences 
were recognized by a factor in similar ways, the factor could 
not choose the correct site. We now examine this aspect 
(specificity) of the rules in two ways. 

We first compare the binding score given to the real binding 
site with those for sites consisting of all other possible base 
combinations (Fig. 3). HTH, C4, arid ZnF recognize four 
base pairs, which have 256 possible combinations. PH rec- 
ognizes three base pairs, and the number of combinations is 
64. In our calculation, the real binding sequence is usually 
found among a small number of DNA sequences that score 
the highest (Fig. 3); the rules are sufficiently specific to 
exclude the rest of the DNA sequences, which score less. To 
evaluate the specificity of the rules, we introduce the spec- 
ificity index, which is defined as (100 - n - f )%, where n is 
the percentage of the DNA sequences that score higher than 
the real binding sequence and m is that of the DNA sequences 
that score the same as the real binding sequence. If a factor 
has two natural binding sequences— sequence /, which scores 
higher than sequence/— n is defined as the percentage that 
scores higher than /, and m is defined as the percentage that 
scores between / and/. The average indices calculated for 
the factors are 93% (PH) (96% if Max is excluded, which is 
further discussed in M.S. and M. Gerstein, unpublished 
results), 99% (C4), 96% (ZnF), and 92% (HTH). 

As a second test we now examine the DNA sequence of a 
region regulated by a transcription factor in vivo. When the 
binding score is calculated for every four base pairs along the 
DNA, shifting one base pair at a time, the highest score is 
given for the experimentally identified binding site (Fig. 4). 
Since DNA has two strands, the score must be calculated 
along each of the two strands. 

The above two tests have shown that the rules are highly 
specific. In the crystal structures, some additional contacts 
are seen from outside a recognition helix, but the binding 
specificity of a recognition helix seems to be essentially 
sufficient to specify uniquely the DNA-binding sites. 

Spacing Type 

An a-helix can bind to no more than five base pairs because 
of the curvature of the DNA major groove; it can access only 
one side of the DNA (44). To recognize more than five base 
pairs, two or more helices are used in combination, essen- 
tially by either relating the two with a twofold symmetry axis 
or repeating them in tandem. The classic HTH proteins and 
zipper proteins of the PH family use "symmetrical" arrange- 
ments (denoted here as S), while ZnF proteins use a "tan- 
dem" arrangement (denoted here as T). C4 proteins use both 
types of arrangements (45). 

Symmetrical arrangements can be characterized by whether 
the C teirninus (denoted with the "+ " sign) or the N terminus 
(denoted with the sign) is closer to the dyad axis and the 
number of bases along the DNA between the two binding sites 
(for example, S +6 for the HTH protein CAP). By knowing the 
spacing type, the plot of the binding score can be improved. 
When the binding scores of the two DNA strands for CAP 
binding are shifted by six base pairs and added to each other, the 
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Fig. 2. Comparison between contacts observed in the crystal structures (a, c, e, », and k) and computer-predicted contacts (6, </, /, /j, 
y, and /). The figures are drawn in the same way as in Fig. 1. The dotted line (• • * •) inj shows an additional predicted hydrophobic interaction 

to the neighboring T base. A pair of two dashed lines ( ) in/show two alternative contacts with the same score. The contacts that are predicted 

but not observed and those observed but not predicted are marked with circles. The side chain of Asn (aa 1) in Mata2 (k) is not well described 
in the original report of the crystal structure (4). The residue is predicted to contact the C (CI) and T (W2) bases (/). Leu (aa 4} of Max is predicted 
to make contacts with C (Cl) or C (C2) (/). The figures of the original report (5) show that this leucine does seem close to C (Cl) t but the 
coordinates have not been published and the paper does not mention this contact. 



new plot shows a clearer peak (Fig. 4e). Thus, a weaker binding 
specificity of a HTH recognition helix (see the previous section) 
is compensated by combining two such helices. 

The spacing type of the majority of ZnF proteins is T -1 
[i.e., two neighboring fingers share one base pair (-1) in a 
tandem (T) arrangement (37)]. A single finger appears to be 
incapable of discriminating between DNA sequences, but the 
combination of two or three fingers does seem to be sufficient 
(see figure 9 of ref. 37). This can explain why fingers are 
always found in a repeat. 

The two experimentally identified ADR1 (ZnF)-binding 
sites in its regulatory DNA region are predicted successfully 
(Fig. 4c). The two sites are likely to be recognized by a 
symmetrical dimer of ADR1 molecules, each of which has 
two ZnF motifs in tandem (T -1), with the superspacing type 
of S +6 (Fig. 4c). Therefore, the communication between 
DNA and proteins can be described with increasing accu- 
racy, from the chemical, the stereochemical, the spacing to 
the superspacing levels. 



Prediction and Design 

Our computer program successfully identifies the binding 
sites of transcription factors whose binding specificities 
have been characterized experimentally. Therefore, it may 
be natural to expect that it can (/) predict the yet unknown 
binding specificity of a protein sequence and (//) design a 
factor that would recognize a particular DNA sequence. 

In the ZnF and C4 families, a simple table relating DNA 
and protein sequences can be produced (Fig. 1 / and g\ ref. 
38). Three residues of C4 — 1, 5, and 9 — bind to the three 
consecutive bases W2-W4, by a simple one residue-one base 
relationship, while ZnF positions -1, 3, and 6 bind to 
W2-W4. Therefore, by choosing specific partner residues in 
the correct columns from Fig. la according to the amino acid 
sizes shown in Fig. 1 d and e, recognition tables for the three 
positions from two types can be constructed (see ref. 38 for 
further discussion). 
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Fio. 3. Distribution of the binding scores for Zif268 finger 3 (ZnF) (a), estrogen receptor (C4) (b), CAP (HTH) (c), Mata2 (PH) (</), and TEF 
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Fig. 4 Prediction of the binding sites for factors: estrogen receptor (C4) (a and d), CAP (HTH) {b and e), and ADR1 (ZnF) (c and f) (a-c) 
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The rules will be further improved as information becomes 
available. For example, in this study, changes in the DNA 
structure upon binding proteins and the sequence-dependent 
differences in the DNA structures are ignored. However, the 
framework and the major features of the rules are unlikely to 
change. We have shown that the DNA-recognition rules for 
well-characterized factors in the four families are simple, 
logical, consistent, and specific. We therefore believe that 
these rules constitute the DNA-recognition code. 
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ABSTRACT 

Genes encoding sequence-specific DNA 
binding proteins can be isolated by 
screening \gtll expression libraries with 
recognition site DNAs. This strategy is 
derived from that developed for the isola- 
tion of genes using antibody probes. Many 
different genes encoding transcriptional 
regulatory proteins have been cloned 
using this strategy. The DNA binding 
domains of these regulatory proteins con- 
tain different structural motifs including 
the helix-twm-helix. the "zinc finger" and 
the "leucine zipper". Various aspects of 
the screening strategy are evaluated and a 
detailed protocol is provided. In addition 
to binding site DNAs, protein and 
nucleotide probes have been successfully 
used to screen expression libraries. There- 
fore ligand based expression screening 
may be quite general in scope. 
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INTRODUCTION 

Sequence-specific DNA binding 
proteins play a central role in decipher- 
ing the structural and regulatory infor- 
mation encoded in cellular and viral 
genomes. They function to initiate as 
well as control the transcription, 
replication and site-specific recombin- 
ation of DNA sequences. Biochemi- 
cally, these proteins determine the 
specificiry and reactivity of enzymatic 
assemblies that act on DNA, _ 

In genetically tractable prokaryotes 
and eukaryotes, most sequence-spe- 
cific DNA binding proteins have been 
identified as the products of trans-ac- 
ting regulatory loci. In many complex 
eukaryotic organisms a similar ap- 
proach to their identification has not 
been possible. Instead, the recent ap- 
plication of sensitive DNA binding as- 
says, in particular, DNase I footprint- 
ing (13) and gel electrophoresis of 
protein-DNA complexes (12,14), has 
led to the detection and characteriza- 
tion of numerous sequence-specific 
DNA binding proteins. A majority of 
these proteins bind selectively to dis- 
tinct transcriptional control elements 
and are thereby implicated in regulat- 
ing the activity of their target genes 
(31). The isolation of recombinant 
clones encoding such proteins would 
facilitate a genetic and biochemical 
analysis of their structural and func- 
tional properties. Prior to the applica- 
tion of the cloning strategy described 
below, genes encoding sequence- 
specific DNA binding proteins could 
be isolated only by screening recom- 
binant DNA libraries with antibody 
(28,49,50) or oligonucleotide probes 
(2,25,49); The latter axe generated from 
partial amino acid sequences of the 



relevant proteins. Both screening 
strategies are dependent on die avail- 
ability of substantial amounts of the 
purified protein. Even though the 
purification of sequence-specific DNA 
binding proteins has been greatly 
facilitated by the development of im- 
proved DNA-afflnity matrices (4,24, 
40), the requirement for very large 
amounts of starting material (tissue or 
cells) makes purification on a prepara- 
tive scale difficult The new strategy 
obviates purification of a sequence- 
Specific DNA binding protein for the 
purpose of isolating its gene. It simply 
requires an appropriate recombinant 
DNA library constructed for expres- 
sion in Eschericha coli and a DNA 
recognition site probe. Therefore, this 
strategy is ideally suited for isolating 
clones encoding rare regulatory 
molecules, 

CLONING STRATEGY 

The cloning strategy depends on the 
functional expression in £. coli of high 
levels of the DNA binding domain of a 
regulatory protein and a strong interac- 
tion between this domain and its recog- 
nition site. If these conditions are ful- 
filled, a recombinant clone encoding a 
sequence-specific DNA binding pro- 
tein can be detected by probing protein 
replica filters of an expression library 
with radiolabeled recognition site 
DNA. An outline of the steps involved 
in identifying and analyzing such a 
clone, using a recombinant library con- 
structed in the expression vector Igtl 1, 
is depicted in Figure 1. The initial 
Phase involves the identification of a 
recombinant clone that is specifically 
detected with the binding site DNA 
probe (X) but not with DNA probes 
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that lack the given binding site or con- 
tain a mutant version of it (Y), sec 
Figure 1. Such a clone is then shown to 
encode a (J-galactosidase fusion protein 
of the expected DNA binding speci- 
ficity. This strategy is derived from tbat 
developed for the isolation of genes 
using antibodies to screen recombinant 
expression libraries (19^2^3). 

Using a r-labeled recognition site 
DfcJA probe with a specific activity of 
10 qpm/janol (ca. 1(T cprn&g), it i$ 
possible tn detect 10' 2 finoi of active 
protein in a plaque (assuming a 1:1 
stoichiometry for the protein-BNA 
complex). This detection limit repre- 
sents 1 pg of a P-galactosidase fusion 
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Figure L Outline of the strategy f or the 
molecular Honing of sequence^ectfic DNA 
Dmding proteins using the expression vector 
AgUl. X is a recognition site DNA probe 
whereas Y is a control DNA probe that tacks the 
given recognition site or contains a mutant ver> 

tion of Igm recombinants that are specifically 
detected with DNA probe X (XX). After plaque 
purification, the gel eJa^ropnoresia DNA bind- 
ing assay ia osed to analyze extracts of IX and 
Agti 2 (A) fysogens. Radiolabeled X^DNTA is used 
as a probe in the binding reactions. F and B refer 
to free and bound X-DNA. respectively. Reac- 
tions id lanes +Xand+Y are carried out with thg 
AX exoscta ind contain an ewesa of either un- 
hb f hd X-DNA or unlabeled Y-DN A as com- 
petitors. 
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protein (ca. Mr 170,000). which is an 
amount flat is well below the expected 
level of expression for such a protein in 
a plaque of the desired recombinant 
Xgtl 1 phage. In fact, ovoexpression of 
the lacZ fusion gene should result in 
the accumulation of ca. 100 pg of the 
fusion protein in a phage plaque, as- 
suming that there are l(r infected 
cells/plaque and that the fkgalac- 
tosidase fusion protein represents \% 
of the total protein mass (0.1 pg) of an 
infected cell. The sensitivity of detec- 
tion achieved with a ^P-labeled recog- 
nition site probe (see above) is com- 
rjwable to that attained with an 
I-Iabeled primary antibody (3) or a 
detection system based on a secondary 
antibody conjugated with alkaline 
phosphatase (29). A comparison of the 
signals generated by a DNA binding 
site probe and an antibody directed 
against the corresponding protein is il- 
lustrated in Figure 2. The Xgtl I recom- 
binant (KEB) encodes a p-galac- 
tosidase fusion protein that contains the 
DNA binding domain of the Epsiein- 
Barr virus nuclear antigen EBNA-1 
(38,44). A protein replica filter 
prepared from a mixed plating of XEB 
and control Xgtil recombinant phage 
was screened initially with a recogni- 
tion site DNA probe (oriP) that con- 
tains two high affinity binding sites for 
EBNA-1 and subsequently with an- 
tibodies directed against EBNA-1. In 



this case, the higher signal obtained 
with the DNA binding site probe was 
attributed to a less sensitive secondary 
antibody conjugate containing horse- 
radish peroxidase (29) used in tm- 
rauno-screenlng. Note that the patterns 
of plaques detected by the two types of 
probes are superimposable. Therefore, 
a DNA binding site probe can be used 
to detect a suitable recombinant phage 
with the same fidelity as an antibody. 



SCREENING OF EXPRESSION 
LIBRARIES 

Using screening conditions 
developed with a model system, Singh 
et al. (44) isolated a cDNA clone that 
encodes an enhancer binding protein 
(H2TF1/NFkB in Table 1); This human 
cDNA clone was detected by screening 
a Xgtl 1 expression library with a bind- 
ing site probe derived from the enhan- 
cer of a major histocompatibility com- 
plex (MHC) class I gene. 77ie 
recombinant clone successfully satis- 
fied the criteria depicted in Figure h It 
was detected only with the wad type 
MHC element (GGGGATTCCCQ 
probe but not with control DNAs that 
lack the MHC element or contained a 
mutant version. Secondly, it specified a 
p-galactosidase fusion protein which 
bound specifically to the MHC element 
in a gel mobility assay. The binding 




toSrctsuy). ^l^^r^^ 0 ^ 13 ^ 3 ^ $ffiofG. Milman. Johns Hop- 



, i / i w i yi .iiii i i| irH i ' u 

|0285612 08^:Jan-03 08:34a) 



Overview 



Genes/Genomes ~ " " ^ 

n««» s . Jt ^, That Contain Tha 

5= B,ndingSlta Binding Site References 



H2TF1/NFjcB* 

NF-A2(Oct-2) 

NF-Al(Oct-l) 

E12 
XBP 
RF-X 
YB-1 
IRF-1 
PRDI-BF 
Ptt-1 
MLTF 



GGGQATTCCCC 

ATQCAAAT 

ATGCAAAT 

GGCAGGTGG 
ND 

CCCCCTAGCAACAG 

GACTAACCGGTTT 

AAGTGA 

GAGAAGTGAAAGTG 
GATTACATGAATATTCATGA 
CACGTGACCG 



MHCI.p2,lflic,SV40,HlV. 
IL-2R, p-IFN 
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site was further delineated by methyla 
tion interference analysis of ihe 
protein-DNA complex. The isolation 
of this clone validated the various as- 
sumptions on which the screening 
strategy is based. It also provided the 
impetus for its application m the isola- 
tion of other clones encoding se- 
quence-specific DNA binding proteins. 
The isolation of a clone encoding a 
lymphoid-spedfic octamer binding 
Pratein^jCNP-A2 (Oct-2) in Table I) 
(7,46) demonstrated the usefulness of 
two modifications. In this case, a multi- 
site DNA probe, consisting of four 
copies of a 26 bp oligonucleotide con- 
ning the octamer motif 
(ATGCAAAT) was employed This in* 
creased the sensitivity of detection of 
the relevant recombinant phage (sec 
below). Furthermore, in this screen, 
sonicated and denatured calf thymus 
DNA was used as a nonspecific com- 
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petitor instead of poly (dl-dC)>poly (dl- 
dC). This substitution reduced the 
number of inappropriate recombinant 
phage that were detected (see below). 

Vinson et al. (48) have described a 
third modification in which the nitro- 
cellulose replica filters were subjected 
to a deiuuiiration/renaturation regimen 
prior to screening. This treatment en- 
hanced the sensitivity of detection of a 
phage X recombinant encoding the en- 
hancer binding protein (C/EBP) (see 
below). This report also demonstrated 
enhancement of the detection signal 
with a multi-site DNA probe. 

In the year following the initial ap- 
plication of this strategy, a large num- 
ber of mammalian cDNA clones en- 
coding distinct sequence-specific DNA 
binding proteins have been isolated 
(Table I). All of these proteins appear 
ro represent transcription factors which 
regulate the activity of different 



promoters and enhancer elements (see 
Table 1). These examples facilitate the 
evaluation of different aspects of the 
screening strategy. 

Construction of Expression Library 

cDNA synthesis and cloning. Suc- 
cessful screening is critically depend- 
ent on the frequency with which func- 
tional recombinants (in-frame fusions 
of fee DNA binding domain with a 
bacterial protein segment) are repre- 
sented in a given cDNA expression 
library. The cDNA library should be 
made from mRNA isolated from a cell 
or tissue source with the highest levels 
of the desired DNA binding protein. 
First-strand cDNA synthesis should be 
carried out using random primers 
rather than oligo(dT) t since the DNA 
binding domain may be encoded in the 
amino- terminal part of the desired 
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protein (5' aid of the corresponding 
mRNA). Adaptors rather than tinkers 
are preferred for ligaring the cDNA in- 
sets to the vector, since they avoid 
digestion of the cDNA with a restric- 
tion enzyme ( 18,51), It should be noted 
that most commercially available 
cDNA expression libraries are con- 
structed using Eco RI linkers. Some of 
these libraries contain a high frequency 
of partial cDNA marts that are flanked 
by natural Eco RI sites, indicating inef- 
ficient protection of internal sites dur- 
ing their construction (K. LeClair, per- 



sonal comrnunication). This can result 
either in the disruption of acDNA seg- 
ment encoding a DNA binding domain 
or in a decrease of the frequency of 
recombinants containing in-frame 
fusions of fee DNA binding domain 
with the bacterial protein segment 

Expression vectors* The phage 
vector Xgtl 1 appears most suitable for 
expression screening. It offers the ad- 
vantages of high cloning efficiency, the 
expression of relatively stable (i-galac- 
tosidase fusion proteins and a simple 
means of preparing protein replica fil- 
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ters. Recently, a new bacteriophage X 
expression vector (XZAP) has been 
described which obviates subcloning of 
cDNA inserts into plasmid vectors for 
their analysis (42). The presence of 
rnultiple cloning sites makes possible 
the use of "forced cloning* strategies 
for expression of cDNA inserts from its 
lac promoter. Unlike Xgtll, AZAP ex- 
presses fusion proteins containing a 
small amino terminal segment of fj- 
galaccosidase. Therefore, the stability 
of XZAP encoded fusion proteins may 
be different from their counterparts en- 
coded in Xgtl 1. 

Plasmid expression vectors can also 
be used to detect clones encoding se- 
quence-specific DNA binding proteins. 
Figure 3 shows that E. coli colonies 
harboring either an EBNA-1 or bac- 
teriophage X O protein-expressing 
plasmid can be specifically detected 
using the corresponding binding site 
DNA probes. Even though phage vec- 
tors are advantageous for most cloning 
applications, plasmid vectors could be 
used to rapidly generate, screen and 
analyze recombinants encoding mutant 
DNA binding domains. • 

Preparation of Protein Replica 
Filters 

Protein replica filters suitable for 
screening with DNA recognition site 
probes are most easily prepared using a 
series of steps derived from the im- 
muno-screening protocol (22, see ac- 
companying protocol). This simple 
procedure has permitted the detection 
of many clones encoding different 
DNA binding proteins, e.g. HTTFlf 
NFkB, Ctet-2, E12, XBP, YB-1, 1RF-1, 
MLTF, CREB (see Table 1). Vinson et 
al. (48) have shown that processing 
dried nitrocellulose replica filters 
through a denaturation/renaturation 
cycle, Rising 6 M guanidine hydro- 
chloride, signficandy enhances the sig- 
nal from a Xgtl 1 recombinant encoding 
C/EBP (see accompanying protocol). 
However, it is not possible from this 
report to directly compare the sen- 
sitivity of the two protocols in detect- 
ing the QEBP phage, since with the 
former the replica filters are not dried. 
The a^natitt^tion/renaniration cycle 
may increase the detection signal by 
facilitating the correct folding of a 
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larger fraction of the E. i&//-expressed 
protein. Alternatively, it may help to 
dissociate insoluble aggregates of the 
fusion protein that form as a conse- 
quence of overexpression. This modi- 
fied procedure has been successfully 
used to isolate clones encoding Oct-l f 
Pit-1. PRDI-BF and RF-X (see Table 
1). This modification allows the re- 
screening of die same replica filter with 
a different DN A probe by repeating the 
denaturation/renaturation cycle, since 
the second denaturation step results in 
dissociation of the DNA probe bound 
in die first screen. 

Screening of Protein Replica Filters 

Recognition site DNA probe. The 
highest affinity site among a set of re- 
lated sequences should be chosen for 
the synthesis of an oligonucleotide 
probe. It has been demonstrated that 
DNA probes containing a single recog- 
nition site can be used to isolate the 
relevant DNA binding protein clones 
(H2TF1/NFKB, XBP, YB-1 and 
MLTF, see Table 1). However, in a 
number of cases (Qct-2, Oct-l t E12, 
see Table 1), the signal was appreci- 
ably enhanced with DNA probes con- 
taining several copies of the ap- 
propriate binding site. This effect is 
demonstrated in Figure 4 with the 
recombinant phage encoding H2TF1/ 
NFkB (Xh3). In this case, the multi-site 
probe (trimer) was generated by clon- 
ing three tandem copies of a 25 bp long 
oligonucleotide containing the H2TF1/ 
NFkB binding site (GGGGAT- 
TCCCQ. When equivalent protein 
replica filters are screened with either 
fee Inner (monomer) or the 3-mer (tri- 
mer) probe (each end-labeled wife 
to fee same specific activity), the latter 
generates a 3-5 fold higher signal. 
Multi-site probes can also be prepared 
for screening simply by catenation of a 
binding site oligonucleotide with DNA 
ligase, followed by "nick translation" 
(48). Such a probe was used to isolate 
the cDN A encoding Pit- 1 (23). 

Enhancement of the signal with a 
multi-site probe may be due to the fact 
that such a probe can simultaneously 
interact with two or more immobilized 
protein molecules, thereby increasing 
the overall stability of the protein-DNA 
complexes (see below). This type of 
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DNA probe is particularly suitable for 
the isolation of clones encoding DNA 
binding proteins with low affinity for 
their recognition sites. Given a number 
of examples in which a multi-site DNA 
increased the detection signal, it is 
clearly ^preferred type of probe. 

Nonspecific competitor DNA* The 
addition of an excess of nonspecific 
competitor DNA in fee probe solution 
reduces background as well as mini- 
mizes fee detection of recombinant 
phage encoding nonsequence-speciftc 
DNA binding proteins. Several dif- 
ferent competitor DNAs have been 
used to successfully screen expression 
libraries (33,44,46). Screens of such 
libraries with poly(dI-dC)'poly(dI-dC) 
as fee nonspecific competitor DNA 
yielded some recombinant phage that 
encoded proteins which preferentially 
bind single-stranded DNA (44). As 
shown in Figure 5, the signal from such 
phage (e.g., Xhl), but not from phage* 
encoding, sequence-specific DNA 
binding proteins (e.g*, Ah3) which 
encode H2TF1/NFkB), could be effi- 
ciently blocked with sonicated and 
denatured calf thymus DNA at a con- 
centration of 5 figfrnL The latter DNA 
further reduced the background signal 
from the filters. Based on the results of 
Figure 5 and given that several clones 
encoding sequence-specific DNA 
binding proteins have been successful- 
ly isolated using sonicated and dena- 



tured calf thymus DNA (eg., Oct-2, 
MLTF and E12, see Table 1), this non- 
specific competitor DNA is preferred. 

Binding and wash conditions. The 
equilibrium association constants of 
sequence-specific DNA binding pro- 
teins range over many orders of mag- 
nitude (10 s - 10 12 M* ). Consideration 
of the equilibrium and kinetic constants 
of a protein-DNA interaction in solu- 
tion suggests that successful screening 
may be restricted to proteins wife rela- 
tively high binding constants, since 
only these are likely to form complexes 
wife half-lives long enough to 
withstand the wash protocol (44), For 
example, if a regulatory protein has an 
association constant of 10 20 fvT 1 , then 
under the screening conditions (the 
DNA probe is in excess and at a con- 
centration of ca. 10" ia M), approxi- 
mately half of the active molecules on 
the filter will have DNA bound. Since 
the filters are subsequently washed for 
30 min, the fraction of protein-DNA 
complexes feat remain will be deter- 
mined by their dissociation rate con- 
stant Assuming a di ffiision-limi ted as- 
sociation rate constant of 10 7 M~ l S' ] 
(1), the dissociation rate constant in 
solution will be 10" 3 ST 1 . This rate con- 
stant translates into a half-life of ap- 
proximately 10 min. Thus, one-eighth 
of fee protein-DNA complexes should 
survive the 30 min wash. For a binding 
constant of 10* M~\ about a tenth of 
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the active protein molecules will have 
DNA bound, but virtually all of this 
signal should be lost since the half-life 
of these complexes in solution is ap- 
proximately 1 nun. It is unclear 
whether the equilibrium and kinetic 
constants of a protem-DNA interaction 
in solution accurately describe the 
binding of a DNA probe to a matrix of 
protein immobilized on a filter. Thus, it 
may be possible to isolate recom- 
binants encoding proteins with binding 
constants of 10 M" 1 or lower. The sen- 
sitivity of detection of a phage encod- 
ing a low affinity variant of the Oct-2 
protein is markedly enhanced by using 
a DNA probe containing multiple bind- 
ing sites (46). Since the association 
constants of DNA-binding regulatory 
proteins aie dependent on ionic 
strength, temperature and pH, these 
parameters can be manipulated in the 
binding and wash steps to optimize the 
detection of a relevant recombinant 



pofy (dl-dC) 



Xh1 




protein. Finally, if the DNA binding 
protein being cloned has an exogenous 
metal ion requirement (eg., Mg**X the 
binding and wash buffers should be ap- 
propriately supplemented 

CHARACTERIZATION OF 
RECOMBINANT DNA BINDING 
PROTEINS 

After the isolation of a recombinant 
phage that is specifically detected with 
a given binding site probe, but not with 
control DNAs, it is necessary to 
demonstrate that this clone encodes a 
recombinant protein of the expected 
DNA binding specificity. La the case of 
a Xgtll recombinant, this is simply 
achieved by isolating lysogenized E. 
colt clones and assaying extracts of in- 
duced lysogens for a fJ-gai&ctosidase 
fusion protein that specifically binds 
the recognition site probe used in the 



CT DNA 




\ 




screen (see accompanying protocol). 
Chemical and enzymatic footprinting 
in conjunction with the analysis of 
mutant binding sites are required to 
rigorously characterize the DNA bind- 
ing specificity of the recombinant 
protein. The criteria used to relate a 
recombinant protein cloned by this 
strategy with a previously charac- 
terized native protein are discussed in 
the foDowing section. 
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DISCUSSION 

In this article we have reviewed the 
development of a new strategy for the 
molecular cloning of sequence-specific 
DNA binding proteins- This strategy 
circumvents purification of such a 
DNA binding protein for the purpose 
of isolating its gene: It simply requires 
a cDNA Horary constructed in the 
phage Xgtl 1 and a DNA recognition 
she probe. As a result of its simplicity 
and its potential to isolate rare cDNA 
clones, mis strategy is expected to 
greatly facilitate the analysis of pro- 
teins thai regulate transcription, DNA 
replication and site-specific recom- 
bination. In fact, within a year of its in- 
troduction, more than ten cDNA clones 
that encode distinct transcriptional 
regulatory proteins have been isolated 
using mis strategy {see Table 1). 

The DNA binding domains of a 
large number of regulatory proteins 
contain either a helix-turri-helix motif 
or the "zinc finger" motif (10,1536, 
41). Clones encoding proteins with 
eit her of these structural motifs can be 
detected by in situ screening with the 
relevant recognition site DNAs, The 
protein encoded by HZTFI/NFkB 
cDNA clone contains two "zinc 
fingers" in its DNA binding domain 
(Baldwin, LeOair. Singh and Sharp, 
unpublished results). In contrast, the 
Oct-2 and Oct-1 cDNA clones encode 
proteins wtih a predicted helix-mrn- 
helix motif (734.46,47). Thus, the 
screening method appears not to be 
restricted to a sub-class of DNA bind- 
ing domains. 

Many sequence-specific DNA bind- 
ing proteins are functional homodi- 
mers. The binding sites of these 
proteins exhibit two-fold rotational 
symmetry. In these cases the affinity of 
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the monomer fat the complete binding 
site is significantly lower than that of 
the dimer (37,41). Clones encoding 
such homodimcric proteins can also be 
detected by in situ screening. The bac- 
teriophage X O protein appears to bind 
its dyad symmetric recognition site in 
orik DNA as a dimer (Ro berts and Mc- 
Macken, personal communication). A 
clone encoding this protein can be 
specifically detected by in situ screen- 
ing of bacterial colonies using orik 
DNA as probe (see Figure 3). The 
mammalian protein C/EBP also ap- 
pears to require dimerizarion for se- 
quence-specific binding (Laodschulz, 
Johnson and McKnight, personal com- 
munication). A Xgtll recombinant en- 
coding this protein can be detected by 
screening plaque lifts with the cor- 
responding DNA binding site probe 
(48). Interestingly, die region of C/EBP 
required for dimerization, the 'leucine 
zipper," is shared by a number of 
regulatory proteins including GCN4, 
Fos, Myc and Jun (28). Recently, 
Murre et aL (35) have used the screen- 
ing strategy described herein to isolate 

cer binding protein (E12, Table 1) that 
requires a new type of dimerizarion 
domain for DNA binding. These ex- 
amples dearly show that clones encod- 
ing proteins that bind DNA as 
horriodimers, using different dimeriza- 
tion domains, can be successfully 
screened as a consequence of their 
functional expression in E. colL 

Most functional DNA binding 
domains, including elements required 
for dimerizarion, are contained within 
relatively small protein segments (ap- 
proximately 60-200 ammo-acids, e.g., 
toe DNA binding domains of EBNA- 
1(38), GAL-4 (26), GCN-4 (20), Spi 
(25)); therefore, successful screening is 
not dependent on full-length cDNA 
clones. It simply requires that a given 
expression library contain partial 
cDNA clones spanning the DNA bind- 
ing domain of die desired protein. 

The screening strategy, although a 
very powerful tool, has limitations. 
Since it relies on functional expression 
of a DNA binding domain in E. coli, it 
is highly unlikely to enable the cloning 
of proteins, which depend either on a 
cell-specific post-translarional modifi- 
cation or a second distinct subunit for 
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high affinity DNA binding. In the case 
of heterodimeric proteins (6,16X one of 
the two subunits may bind trie recogni- 
tion ace with an affinity that makes the 
isolation of its gene by in situ serening 
feasible. For example, in the AP-1 and 
c~Fos complex, c-Fos confers high af- 
finity binding, but the AP-1 subunit 
alone binds die same recognition site 
with detectable affinity (8»17X Given 
the clone for one subunit of a hetero- 
dimeric DNA binding protein, it may 
be possible to clone the gene encoding 
the second subunit by using a variation 
of the expression screening approach 
(see below). Another limitation of this 
strategy is, initially a recombinant 
protein can only be related to a pre- 
viously identified native protein by 
comparing the DNA binding specifici- 
ties of the two. However, in a situation 
where multiple DNA binding proteins 
recognize the same sequence, this cri- 
terion is very difficult to apply (21,44). 
Eventually, direct structural analyses 
are necessary to resolve this issue. An- 
tibodies generated against the cloned 
protein permit the detection of shared 
antigenic determinants (47), Peptide 
m apping performed on analytical 
amounts of the native and cloned pro- 
teins constitutes a definitive structural 
comparison (7). A third limitation of 
this strategy is that its application can 
result in the isolation of recombinant 
phage whose p-galactosidase fusion 
proteins do not appear to bind DNA 
with detectable affinity hi solution (T. 
Kristie, personal communication). 

The strategy of cloning a gene on 
the basis of detection of its functional 
recombinant product with a ligand 
probe, has considerable potential It 
may be possible to use different types 
erf ligands, including RNA recognition 
sites, hormones, protein subunits (eg., 
a subunit of a heterodimeric DNA 
binding protein), nucleotides, metal 
ions etc., to directly clone genes that 
encode the relevant proteins. During 
the development and application of the 
strategy reviewed in this article, a few 
ligand-medialed screens of this type 
have been described. The cDNA for a 
calmodul in-binding protein has been 
cloned using iodinated calmodulin as a 
probe to screen a Agtll expression 
library (43). Similarly, XgtU clones 
expressing the regulatory subunit of a 



cAMP dependent protein kinase could 
be detected bv an in situ screen of a 
library using P-labeled cAMP as a 
probe (27). Finally, mutants of ras that 
are defective in GTP binding have been 
isolated by using ^-labeled GTP in 
an in situ colony-binding assay (11). 
Thus, the principle underlying this stra- 
tegy appears quite general in its scope. 
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